![]()
data.csv includes lastest edition FIFA 2019 players attributes like Age, Nationality, Overall, Potential, Club, Value, Wage, Preferred Foot, International Reputation, Weak Foot, Skill Moves, Work Rate, Position, Jersey Number, Joined, Loaned From, Contract Valid Until, Height, Weight, LS, ST, RS, LW, LF, CF, RF, RW, LAM, CAM, RAM, LM, LCM, CM, RCM, RM, LWB, LDM, CDM, RDM, RWB, LB, LCB, CB, RCB, RB, Crossing, Finishing, Heading, Accuracy, ShortPassing, Volleys, Dribbling, Curve, FKAccuracy, LongPassing, BallControl, Acceleration, SprintSpeed, Agility, Reactions, Balance, ShotPower, Jumping, Stamina, Strength, LongShots, Aggression, Interceptions, Positioning, Vision, Penalties, Composure, Marking, StandingTackle, SlidingTackle, GKDiving, GKHandling, GKKicking, GKPositioning, GKReflexes, and Release Clause. SOURCE: https://www.kaggle.com/karangadiya/fifa19
#importing required libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline
pd.set_option('display.max_columns', 100) #--> IN ORDER TO DISPLAY ALL COLUMNS IN THE DATASET.
df = pd.read_csv(r"E:\Data_Science_Journey\Course Notes\FWD\Data Visualization\archive\data.csv", index_col= 0)
df
In this section I will dig into the given data, looking for missing values, duplicates or any other data clearning approach.
Here I will eliminate all columns I might not use for my analysis.
df.columns
columnsOfInterest = ['ID', 'Name', 'Age', 'Nationality', 'Overall',
'Potential', 'Club', 'Value', 'Wage', 'Special',
'Preferred Foot', 'International Reputation', 'Weak Foot',
'Skill Moves', 'Work Rate', 'Position','Height', 'Weight', 'Crossing',
'Finishing', 'HeadingAccuracy', 'ShortPassing', 'Volleys', 'Dribbling',
'Curve', 'FKAccuracy', 'LongPassing', 'BallControl', 'Acceleration',
'SprintSpeed', 'Agility', 'Reactions', 'Balance', 'ShotPower',
'Jumping', 'Stamina', 'Strength', 'LongShots', 'Aggression',
'Interceptions', 'Positioning', 'Vision', 'Penalties', 'Composure',
'Marking', 'StandingTackle', 'SlidingTackle']
#--> Here is our new dataset.
df = df[columnsOfInterest]
df
df.isnull().sum()
since there are a few amount of missing values let's drop them.
df.dropna(axis = 0, inplace = True)
df.info()
Cool!
df.duplicated().sum()
No duplicated rows found!.
Here is a crucial point must be considered for further analysis. For any money related column, values are describe in 'K' for thousands and 'M' for millions of pounds the next approach I will convert them into numeric values in order to use them later in our analysis.
def valueConverter(dataframe, col, currency = "€"):
'''Take a dataframe and a column in a string format and return equivalent a numeric one'''
dataframe[col] = dataframe[col].str.replace(currency, "")
inKs = dataframe[dataframe[col].str.contains('K')] #--> taking values described in thousands.
inMs = dataframe[dataframe[col].str.contains("M")] #--> taking values describe in millions.
inKs[col] = inKs[col].str.replace("K", "").astype(float) * 1000
inMs[col] = inMs[col].str.replace("M", '').astype(float) * 1000000
return pd.concat([inKs, inMs], ignore_index = True)
#--> Let's for now convert Value and Wage columns.
df = valueConverter(df, 'Wage')
df = valueConverter(df, "Value")
df
Another cleaning should be made with Height and Weight column to make them in a suitable format for analysis.
df.Height = df.Height.str.replace("'", ".").astype(float) * 30.48 #Formula: multiply the length in feets value by 30.48 to convert it into cms
df.Weight = df.Weight.str.replace('lbs', "").astype(float) #--> Lets leave this is in pounds weight.
df
All is set!.
in this section my analysis will focus on single variables like distributions and sortations.
let's take an overview how age frequency varies for professional soccer players?
df.Age.describe()
bins = range(15, 46, 5)
plt.hist(data = df, x = 'Age', bins = bins, ec = 'black', color = 'orange');
plt.title("PLayers' Ages Distribution");
plt.xlabel("Age");
plt.ylabel("Frequency")
Well, most of players have between 20 to 25 yrs.
Say we have a child, and we want to check wheather a soccer player would be a good profession or not, maybe getting an idea about salaries would be beneficial here.
df.Value.describe()
bins = np.logspace(3, 8, 30)
plt.hist(data = df, x = "Wage", bins = bins, ec = 'white')
plt.xscale('log')
plt.xlim(1000, 1e6)
xticks = np.logspace(3, 8, 30)
xlabels = ["{:.01f}".format(x) for x in xticks]
plt.xticks( xticks, xlabels, rotation = -90);
plt.title("Wages Distribution");
plt.xlabel("Wages[In Pounds]");
plt.ylabel("Frequency");
Well, maybe this wasn't that insightful because most of players get payed around 1000 or 7200, So that child should be well prepared to excel in order to get paid in millions!
Now let's take a look at height and weight distributions.
df.Weight.describe()
bins = range(100, 245, 5)
plt.hist(data = df, x = 'Weight', bins = bins, ec = 'black', color = 'orange');
plt.title("Weight Distribution");
plt.xlabel("Weight[lbs]");
plt.ylabel("Frequency");
Most of weights are around 150 and 170 lbs.
df.Height.describe()
bins = range(150, 215, 5)
plt.hist(data = df, x = 'Height', bins = bins, ec = 'black', color = 'orange');
plt.title("Height Distribution");
plt.xlabel("Height[cms]");
plt.ylabel("Frequency");
we got over 4,000 player with height around 160cms, on the other hand a lot of players have heights of 185 to 190 cms
Actually I like top 10s, lets find some top 10s based on players and clubs!.
based on overall scores!
top10players = df[['Name', 'Overall']].sort_values(['Overall'], ascending = False)[:10]
sb.barplot(data = top10players, y = 'Name', x = "Overall");
plt.xlim(90, 95);
plt.title('Top 10 Performing Players');
Well, It looks like Messi and Ronaldo followed by Naymar have best performances
top10valued = top10players = df[['Name', 'Value']].sort_values(['Value'], ascending = False)[:10]
sb.barplot(data = top10valued, y = "Name", x = "Value");
plt.xlim(.5e8, 1.3e8);
plt.title("TOP 10 VALUED PLAYERS")
Is that why Naymar always changes his hair color!
Based on club overall.
top10clubs = df[['Club', "Overall"]].groupby("Club").mean().sort_values('Overall',ascending = False)[:10].reset_index()
sb.barplot(data = top10clubs, y = 'Club', x = 'Overall');
plt.xlim(70, 85);
plt.title("TOP 10 PERFORMING CLUBS");
Ohh, italian clubs invaded this rank!
Top paying clubs
top10paying = df[['Club', "Wage"]].groupby("Club").mean().sort_values("Wage", ascending = False).reset_index()[:10]
plt.bar(top10paying.Club, top10paying.Wage)
plt.xticks(rotation = -90);
plt.title('TOP 10 PAYING CLUBS');
Well here might be some outliers so let's leave this for now!.
In this section I will dedicate my attention to finding relationship between different variable.s
Let's begin by checking relationship between overall performance and different abilities with a huge Pair plot!.
df.columns
vars_ = ['Overall', 'Crossing', 'Finishing',
'HeadingAccuracy', 'ShortPassing', 'Volleys', 'Dribbling', 'Curve',
'FKAccuracy', 'LongPassing', 'BallControl', 'Acceleration',
'SprintSpeed', 'Agility', 'Reactions', 'Balance', 'ShotPower',
'Jumping', 'Stamina', 'Strength', 'LongShots', 'Aggression',
'Interceptions', 'Positioning', 'Vision', 'Penalties', 'Composure',
'Marking', 'StandingTackle', 'SlidingTackle']
len(vars_)
fig, axes = plt.subplots(nrows= 10, ncols= 3, figsize = (20, 60))
vars_ = np.array(vars_).reshape(10, 3)
for i in range(10):
for j in range(3):
axes[i][j].scatter(data = df,y = "Overall",x= vars_[i][j], alpha = .1)
axes[i][j].set_xlabel(vars_[i][j])
axes[i][j].set_ylabel("Overall")
From the plot above we could notice a strong positive relations between (Reactions, Composure) and overall performance, and moderate positive relations between (Short Passing, Long Passing, Ball control, Shot power, Vision) and overall performance.
plt.figure(figsize = (20, 20))
sb.heatmap(df[['Overall', 'Crossing', 'Finishing',
'HeadingAccuracy', 'ShortPassing', 'Volleys', 'Dribbling', 'Curve',
'FKAccuracy', 'LongPassing', 'BallControl', 'Acceleration',
'SprintSpeed', 'Agility', 'Reactions', 'Balance', 'ShotPower',
'Jumping', 'Stamina', 'Strength', 'LongShots', 'Aggression',
'Interceptions', 'Positioning', 'Vision', 'Penalties', 'Composure',
'Marking', 'StandingTackle', 'SlidingTackle']].corr(), annot= True);
plt.title("A graph shows how pairs of soccer abilities affect each other")
With a close look at the above heatmap we could determine the relationship between each pair variables as well.
What about the same approach for same variables vs Value!
fig, axes = plt.subplots(nrows= 10, ncols= 3, figsize = (20, 60))
vars_ = np.array(vars_).reshape(10, 3)
for i in range(10):
for j in range(3):
axes[i][j].scatter(data = df,y = "Value",x= vars_[i][j], alpha = .1)
axes[i][j].set_xlabel(vars_[i][j])
axes[i][j].set_ylabel("Value")
axes[i][j].set_yscale("log")
a really strong relation between (Overall, Composure, Reactions, BallControl, Finishing) and Value, and weak to moderate relations vs. the other!.
Let's now check relation between Overall vs. Height then Overall vs. Height
plt.figure(figsize = (20, 8))
plt.figure
plt.subplot(1, 2, 1)
plt.hist2d(data = df, x = 'Height', y = 'Overall');
plt.title("Effect of Height on the Overall Performance");
plt.colorbar(label = "Density")
plt.subplot(1, 2, 2)
plt.hist2d(data = df, x = 'Weight', y = 'Overall');
plt.colorbar(label = 'Density');
plt.title("Effect of Weight on the Overall Performance");
sb.heatmap(df[["Overall", "Height", "Weight"]].corr(), annot= True);
plt.title("Correlation Coefficients Between Heights, Weights, And overall")
From the couple heatmaps we could notice that relation between Overall vs. weight and Overall vs. Height is a bit weak!. But a noticable population of players with average Height of 185cm have 60 ~ 70 performance rate, and another population got 160 ~ 180 lbs with the same performance rate!
now let's take a subset of the top 10 performing clubs and study them further more!.
top10clubs
top10clubs.Club.values
mask = [True if x in top10clubs.values else False for x in df.Club]
subDF = df[mask]
subDF
Let's how different features vary among these clubs
sb.pointplot(data = subDF, x = 'Club', y= 'Overall', order= top10clubs.Club);
plt.xticks(rotation = -90);
plt.title("TOP 10 PERFORMING CLUBS")
We obtained this before let's dig deeper.
sb.pointplot(data = subDF, x = 'Club', y= 'Potential', order= top10clubs.Club);
plt.xticks(rotation = -90);
plt.title("POTENTIAL LEVEL OF TOP 10 PERFORMING CLUBS");
Here Barcelona is a strong competitor to Juventus when it comes to Potential rate!
Let's see which club pays better!, and here outliers won't distract us !
sb.boxplot(data = subDF, x = 'Club', y= 'Wage', order= top10clubs.Club);
plt.xticks(rotation = -90)
jev_med = subDF[subDF.Club == "Juventus"].Wage.median()
plt.axhline(y = jev_med, color = 'red');
plt.title("WAGES 5 NUMBERS SUMMARY FOR TOP 10 PERFORMING CLUBS");
Jeventus, Bercelona,and Real Madried on median thay pay closer wages!
Let's check how their heights and weights vary!
sb.boxplot(data = subDF, x = 'Club', y= 'Weight', order= top10clubs.Club);
plt.xticks(rotation = -90);
plt.title("HOW WEIGHTS CHANGE FOR TOP 10 PERFORMING CLUBS")
Players whose play in Bercelona got a harsh diet!
sb.boxplot(data = subDF, x = 'Club', y= 'Height', order= top10clubs.Club);
plt.xticks(rotation = -90);
plt.title("HOW HEIGHTS CHANGE FOR TOP 10 PERFORMING CLUBS")
Players in Milan have a wide range of heights!
Here is a funny one!, say we are about to play penalties game, which team has more possibility to win this game!
sb.violinplot(data = subDF, x = 'Club', y= 'Penalties', order= top10clubs.Club);
jev_med = subDF[subDF.Club == "Juventus"].Penalties.median()
plt.axhline(y = jev_med, color = 'red');
plt.xticks(rotation = -90);
plt.title("WHICH CLUB WOULD WIN A PENALTIES GAME");
Actually they're all have closer median values but Juventus is the higher, then Bayern and Barcelona!
df
Let's check the relationship between Position and Overall with respect to preferred Preferred Foot!
plt.figure(figsize =(20, 8))
sb.pointplot(data = df, x = "Position", y = 'Overall', hue = "Preferred Foot", dodge = .2, linestyles = "");
plt.title("Overall Performance Change with Position with respect to Preferred Foot")
Well, left legged players at Left Forward position has higher overall values!
Let's check the same for Shot Power
plt.figure(figsize =(20, 8))
sb.pointplot(data = df, x = "Position", y = 'ShotPower', hue = "Preferred Foot", dodge = .2, linestyles = "");
plt.title("Shooting Power Change with Position with respect to Preferred Foot");
Forward players shoot the best weather they left legged or right legged!
Let's check how Overall changes with respect to preferred and position!
plt.figure(figsize = (15, 10))
cat_means = df.groupby(['Preferred Foot', 'Position']).mean()['Overall']
cat_means = cat_means.reset_index(name = 'Overall')
cat_means = cat_means.pivot(index = 'Position', columns = 'Preferred Foot',
values = 'Overall');
sb.heatmap(cat_means, annot = True, fmt = '.3f',
cbar_kws = {'label' : 'mean(Overall)'});
plt.title("How Overall performance is ditributed among different positions with different footendess")
From here we could determine which player should play in different positions depending on Overall rate! For example for LAM we should put a Left legged player but in RWB we should put a right legged one! despite small differences.
ir = pd.api.types.CategoricalDtype([5, 4, 3, 2, 1], ordered= True)
df["International Reputation"] = df["International Reputation"].astype(ir)
ir = pd.api.types.CategoricalDtype([1, 2, 3, 4, 5], ordered= True)
df["Skill Moves"] = df["Skill Moves"].astype(ir)
Lets see relationship among International Reputation, Skill Moves, and Overall
plt.figure(figsize = (15, 10))
cat_means = df.groupby(['Skill Moves', 'International Reputation']).mean()['Overall']
cat_means = cat_means.reset_index(name = 'Overall')
cat_means = cat_means.pivot(index = 'International Reputation', columns = 'Skill Moves',
values = 'Overall')
sb.heatmap(cat_means, annot = True, fmt = '.3f',
cbar_kws = {'label' : 'mean(Overall)'});
plt.title("How International Reputation, Skill Moves, and Overall affect each other!")
Here we could notice how International Reputation increases with the increase of skill moves and Overall rate
grid2 = sb.FacetGrid(data = df, hue = 'Skill Moves');
grid2.map(sb.regplot, 'Overall', "Value", x_jitter = .5, fit_reg = False, scatter_kws = {"alpha":.4})
figure = plt.gcf()
figure.set_size_inches(12, 8);
# plt.xlim(70, 100)
plt.yscale('log')
plt.legend(title = 'Skill Moves Rate');
plt.title('How increasing Overall rate affect on how the player is valued!');
Players with 5 or 4 skill moves rate have high Overall and Value which is intuitive!
Let's check how height and weight affect on the player's Stamina
plt.figure(figsize = (12, 8))
plt.scatter(data = df, x = 'Weight', y = 'Height', c = 'Stamina');
plt.colorbar(label = 'Stamina Rate');
plt.title("How Weight and Height affect the player's Stamina");
plt.xlabel('Weight[lbs]')
plt.ylabel("Height[cms]");
Well it's obvious that Stamina crucially decrease with increase in height and weight!
Finally let's check relation between Heading Accuracy, Height, Jumping
df.columns
plt.scatter(data = df, y = 'HeadingAccuracy', x = 'Jumping',c = 'Height', cmap= 'viridis_r');
plt.colorbar(label = 'Hight[cms]');
plt.title("Effect of Jumping on Haeding Accuracy with respect to the player's height");
plt.xlabel("Jumping Rate")
plt.ylabel("Heading Accuracy Rate");
Ohh, it looks like there is no clear pattern between them!
This section will cover relations and insights obtained from the previous exploratory step, Most of graphs here are same as above but polished and organized to be easily intrepreted!.
plt.figure(figsize = (12 , 8))
bins = range(15, 46, 5)
n, bins, patches = plt.hist(data = df, x = 'Age', bins = bins, ec = 'black');
patches[1].set_color("#ef4f4f");
patches[1].set_edgecolor("black");
plt.title("Players' Age Distribution");
plt.xlabel("Age");
plt.ylabel("Frequency");
Well, most of players have between 20 to 25 yrs.
plt.figure(figsize = (12 , 8));
bins = range(100, 245, 5)
n, bins, patches = plt.hist(data = df, x = 'Weight', bins = bins, ec = 'black');
plt.title("Players' Weights Distribution");
plt.xlabel("Weight[lbs]");
plt.ylabel("Frequency");
Well it seems like weights are normally distributed binomially around 150lbs and 175lbs
plt.figure(figsize = (12 , 8))
bins = range(150, 215, 5)
plt.hist(data = df, x = 'Height', bins = bins, ec = 'black');
plt.title("Players' Weights Distribution");
plt.xlabel("Height[cms]");
plt.ylabel("Frequency");
we got over 4,000 player with height around 160cms, on the other hand a lot of players have heights of 185 to 190 cms
plt.figure(figsize = (12 , 8))
top10players = df[['Name', 'Overall']].sort_values(['Overall'], ascending = False)[:10]
sb.barplot(data = top10players, y = 'Name', x = "Overall", color = sb.color_palette()[0]);
plt.xlim(85, 95);
plt.title("TOP 10 PLAYERS IN FIFA 19");
plt.xlabel("Overall Performance Rate");
plt.ylabel("Player's Name");
Ronaldo and Messi come first, followed by Naymar Jr.!
plt.figure(figsize = (12 , 8))
top10valued = top10players = df[['Name', 'Value']].sort_values(['Value'], ascending = False)[:10]
sb.barplot(data = top10valued, y = "Name", x = "Value", color= sb.color_palette()[0]);
plt.xlim(.5e8, 1.3e8)
xticks = np.arange(50e6, 150e6, 10e6)
xlabels = ['{:.0f}'.format(x) for x in xticks]
plt.xticks(xticks, xlabels)
plt.title("TOP 10 PAIED PLAYERS IN FIFA 19");
plt.xlabel("Overall Performance Rate");
plt.ylabel("Player's Name");
plt.grid()
Unsurprisingly, Naymar gets the highest amount,then Messi and De Bruyne
plt.figure(figsize = (12 , 8))
top10clubs = df[['Club', "Overall"]].groupby("Club").mean().sort_values('Overall',ascending = False)[:10].reset_index()
sb.barplot(data = top10clubs, y = 'Club', x = 'Overall', color= sb.color_palette()[0]);
plt.xlim(70, 85);
plt.title("TOP 10 CLUBS IN FIFA 19");
plt.xlabel("Overall Performance Rate");
plt.ylabel("Club");
5 Italian clubs are in the top 10, Juventus comes first then Napoli and Inter
plt.figure(figsize= (20, 12))
plt.subplot(1,3,1)
plt.scatter(data = df, x = 'Reactions', y = 'Overall')
plt.title("Reactions Vs Overll Performance", fontsize = 14, weight = "bold")
plt.xlabel("Reactions Rate", fontsize = 14, weight = "bold")
plt.ylabel("Overall Performance", fontsize = 14, weight = "bold");
#--
plt.subplot(1,3,2)
plt.scatter(data = df, x = 'Composure', y = 'Overall')
plt.title("Composure Vs Overll Performance", fontsize = 14, weight = "bold")
plt.xlabel("Composure Rate", fontsize = 14, weight = "bold")
plt.ylabel("Overall Performance", fontsize = 14, weight = "bold");
#--
plt.subplot(1,3,3)
plt.scatter(data = df, x = 'Vision', y = 'Overall')
plt.title("Vision Vs Overll Performance", fontsize = 14, weight = "bold")
plt.xlabel("Vision Rate", fontsize = 14, weight = "bold");
plt.ylabel("Overall Performance", fontsize = 14, weight = "bold");
plt.suptitle('Relation Between Soccer abilities and Overall Performance', fontsize = 16, weight = "bold");
Reactions, Composure, Vision
plt.figure(figsize = (12, 8))
sb.boxplot(data = subDF, y = 'Club', x= 'Wage', order= top10clubs.Club, color = sb.color_palette()[0]);
jev_med = subDF[subDF.Club == "Juventus"].Wage.median()
plt.axvline(x = jev_med, color = 'red');
plt.title("5 Numbers summery for the top 10 paying clubs", fontsize = 14, weight = "bold");
plt.ylabel("Club", fontsize = 14, weight = "bold");
plt.xlabel("Amount[pounds]", fontsize = 14, weight = "bold");
Well Juventus comes first, then Barcelona and Real Madrid and Roma comes the last!
plt.figure(figsize = (20, 20))
plt.title("Correlations among all soccer abilities")
sb.heatmap(df[['Overall', 'Crossing', 'Finishing',
'HeadingAccuracy', 'ShortPassing', 'Volleys', 'Dribbling', 'Curve',
'FKAccuracy', 'LongPassing', 'BallControl', 'Acceleration',
'SprintSpeed', 'Agility', 'Reactions', 'Balance', 'ShotPower',
'Jumping', 'Stamina', 'Strength', 'LongShots', 'Aggression',
'Interceptions', 'Positioning', 'Vision', 'Penalties', 'Composure',
'Marking', 'StandingTackle', 'SlidingTackle']].corr(), annot= True, cmap= "viridis_r");
plt.figure(figsize =(20, 8))
sb.pointplot(data = df, x = "Position", y = 'Overall', hue = "Preferred Foot", dodge = .2, linestyles = "");
plt.title("Position vs. Overall Performance", fontsize = 14, weight = "bold");
plt.xlabel("Position", fontsize = 14, weight = "bold");
plt.ylabel("Overall Performance", fontsize = 14, weight = "bold");
Left legged players at Left Forward positions have the higher performance values!
plt.figure(figsize =(20, 8))
sb.pointplot(data = df, x = "Position", y = 'ShotPower', hue = "Preferred Foot", dodge = .2, linestyles = "")
plt.title("Position vs. Shooting Power", fontsize = 14, weight = "bold");
plt.xlabel("Position", fontsize = 14, weight = "bold");
plt.ylabel("Shootin Power", fontsize = 14, weight = "bold");
Players that play Forward have high shooting power then whose in the Middle but footedness doesn't affect that much!
plt.figure(figsize = (15, 10))
cat_means = df.groupby(['Skill Moves', 'International Reputation']).mean()['Overall']
cat_means = cat_means.reset_index(name = 'Overall')
cat_means = cat_means.pivot(index = 'International Reputation', columns = 'Skill Moves',
values = 'Overall')
sb.heatmap(cat_means, annot = True, fmt = '.3f',
cbar_kws = {'label' : 'mean(Overall)'});
plt.title("Skill Moves vs. International Reputation vs. Overall Performance", fontsize = 14, weight = "bold");
Yes they strongly are!
plt.figure(figsize = (15, 10))
plt.scatter(data = df, x = 'Weight', y = 'Height', c = 'Stamina')
plt.colorbar(label = 'Stamina Rate');
plt.title("Weight vs. Height vs. Stamina", fontsize = 14, weight = "bold");
plt.xlabel("Height[cms]");
plt.ylabel("Weights[lbs]");
We could Notice that low stamina rates are observed at high weights and heights!